BG2: Bayesian variable selection in generalized linear mixed models with nonlocal priors for non-Gaussian GWAS data

Background Genome-wide association studies (GWASes) aim to identify single nucleotide polymorphisms (SNPs) associated with a given phenotype. A common approach for the analysis of GWAS is single marker analysis (SMA) based on linear mixed models (LMMs). However, LMM-based SMA usually yields a large number of false discoveries and cannot be directly applied to non-Gaussian phenotypes such as count data. Results We present a novel Bayesian method to find SNPs associated with non-Gaussian phenotypes. To that end, we use generalized linear mixed models (GLMMs) and, thus, call our method Bayesian GLMMs for GWAS (BG2). To deal with the high dimensionality of GWAS analysis, we propose novel nonlocal priors specifically tailored for GLMMs. In addition, we develop related fast approximate Bayesian computations. BG2 uses a two-step procedure: first, BG2 screens for candidate SNPs; second, BG2 performs model selection that considers all screened candidate SNPs as possible regressors. A simulation study shows favorable performance of BG2 when compared to GLMM-based SMA. We illustrate the usefulness and flexibility of BG2 with three case studies on cocaine dependence (binary data), alcohol consumption (count data), and number of root-like structures in a model plant (count data). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05468-w.

Denote the current estimate of the mean vector µ µ µ evaluated at β β β c , β β β s , α α α 1 and α α α 2 by µ µ µ.That is, In addition, denote the current estimate of the covariance matrix of ϵ ϵ ϵ evaluated at β β β c , β β β s , α α α 1 and α α α 2 by V , that is Reorganize Equation (S.1) by moving µ µ µ to the left side, pre-multiplying V −1 and then moving X c β β β c , X s β β β s , α α α 1 and α α α 2 to the left side.Let y ⋆ y ⋆ y ⋆ be equal to the left side of the resulting equation, that is The vector y ⋆ y ⋆ y ⋆ is known as the vector of adjusted observations, which is computed as a function of the current estimates of fixed effects β β β c and β β β s , and random effects α α α 1 and α α α 2 .The right side of the resulting equation is X c β β β c + X s β β β s + α α α 1 + α α α 2 + V −1 ϵ ϵ ϵ.Hence, we obtain the following approximate model for the adjusted observations Further, assuming that V −1 V V −1 ≈ V −1 and applying properties of expectation and variance, we get If we further assume that ϵ ϵ ϵ has an approximate normal distribution, then As we explain below, estimates of β β β c , β β β s , α α α 1 , α α α 2 , κ 1 and κ 2 are updated iteratively.After convergence, we have the final vector of adjusted observations y ⋆ y ⋆ y ⋆ and the approximate LMM The closed form of the likelihood function with respect to the unknown parameters is Let κ 1 and κ 2 be current estimates of the variance components.We update β β β c and β β β s with the conditional posterior mean of , where X = (X c , X s ) and H = κ 1 Σ+ κ 2 I + V −1 .Given current estimates β β β c and β β β s , we use the conditional posterior mean to update the estimates of α α α 1 and α α α 2 , that is Algorithm 1 Pseudo-likelihood approach for baseline models procedure Pseudo likelihood(y y y, X c , X s ) Initial values: 2 = 0. Calculate µ µ µ (0) , V (0) , H (0) and y ⋆ y ⋆ y ⋆(0) .while β β β c , β β β s , κ 1 and κ 2 not converge do β β β And then, to update the estimates of κ 1 and κ 2 , we maximize the profile pseudo-likelihood ).The pseudo-likelihood algorithm proceeds iteratively updating the parameters until convergence.Algorithm 1 summarizes the pseudo-likelihood approach for baseline models.

S1.2. Model fitting for non-baseline models
In each BG2 step, a baseline model is fitted with the pseudo-likelihood approach presented in Section 1.1.This results in estimated variance parameters and a vector of adjusted observations y ⋆ y ⋆ y ⋆ .After that, these variance parameter estimates and vector of adjusted observations y ⋆ y ⋆ y ⋆ are used to fit the non-baseline models in that BG2 step.As a result, fitting a non-baseline model does not have any iteration, but just computes the estimate of the regression coefficients with the formula The matrix H, which is a function of the variance parameters, is estimated with the baseline model at the beginning of the respective BG2 step, and then it remains fixed for all non-baseline models.Hence, the eigen decomposition of H can be computed at the beginning of the BG2 step and the same linear algebra trick used in EMMAX can be used in BG2.Therefore, fitting non-baseline models is super fast.
The baseline model differs in the screening step and in the model selection step.In the screening step, the baseline model is the null model with no SNPs, that is, X s is not included in the null model.In the model selection step, the baseline model is the full model with X s containing all candidate SNPs identified in the screening step.S4.Robustness of BG2 when dealing with imbalanced binary data or highly skewed count data In the original simulation studies for count data presented in Section 4.2, we have skewed count data.To visualize that, Figure S5(a) shows that when β = 0.1 the count data are skewed.In addition, Figure S5(b) shows that when β = 0.7 the count data are tremendously skewed.Table S1 shows that skewed count data do not affect variable selection.As a matter of fact, when BG2 is applied to more skewed data (β = 0.7) the performance of BG2 improves with higher TP, lower FP and FDR, and higher F1 score.The binary data we have in the original simulation study presented in Section 4.1 is almost balanced, with about 56% 0s and 44% 1s.We have added a new simulation study with β 0 = 2, which is highly imbalanced with about 29% 0s and 71% 1s.Table S2 shows that BG2 is robust to imbalanced data, and even performs slightly better when the data are imbalanced.

S5. Robustness of BG2 to genome spacing of SNPs
To verify the robustness of BG2 to genome spacing of SNPs, we have added three new simulation studies that we name SIM1, SIM2, and SIM3.These simulation studies expand the simulation study from Section 4.1 for binary data based on human genome.
In the first simulation study SIM1, we generate data from 20 causal SNPs, which are from 4 clusters.In each cluster, there are 5 causal SNPs.Each cluster has a length of 30000 bp.In two clusters, two SNPs have large coefficient β = 1.In another two clusters, three SNPs have large coefficient β = 1.All the other SNPs have small coefficient 0.2 or −0.2.For reference, simulation study SIM0 in the Table S3 is the simulation study in Section 4.1, which has the same parameter setting except that 20 SNPs are evenly spaced.
In the second simulation study SIM2, we generate data from 10 causal SNPs, which are from 2 clusters.In each cluster, there are 5 causal SNPs.Each cluster has a length of 30000 bp.In one cluster, two SNPs have large coefficient β = 1.In another two clusters, three SNPs have large coefficient β = 1.All the other SNPs have small coefficient 0.2 or −0.2.
In the third simulation study SIM3, we generate data from 5 causal SNPs, which are from only one cluster.The length of the cluster is 30000 bp.Three SNPs have large coefficient β = 1.One SNP has coefficient 0.2, and another SNP has coefficient −0.2.
Table S3 shows that BG2 can detect almost all SNPs with large coefficient.The number of causal SNPs and the position of SNPs do not alter the performance of the method BG2.Comparing SIM0 and SIM1, BG2 for clustering causal SNPs has lower FP and FDR, and higher F1.Comparing SIM1, SIM2, and SIM3, small number of causal SNPs make BG2 have lower FDR and higher F1 score.

S6. Sensitivity of BG2 to parameter values
To study the sensitivity of BG2 to the values of parameters, we have added three new simulation studies.The first of these simulation studies is SIM4 where, instead of 0.2 or -0.2, the regression coefficients for 10 causal SNPs are 0.4 and -0.4.Other 10 causal SNPs' coefficient β have six parameter settings: 0.2, 0.3, 0.4, 0.5, 0.7 and 1. Intercept β 0 = −0.5.Variance component for kinship random effects κ = 0.15. Figure S6 presents results for the SIM4 simulation study.Compared with Figure 1, Figure S6 shows that when all 20 causal SNPs' coefficient are equal to 0.4, BG2 can detect about 16 causal SNPs.Otherwise, BG2 can detect about 10 SNPs with relative large coefficient.In addition, BG2 has higher F1 score in Figure S6 than in Figure 1.

S7. Calibration of the pseudo-likelihood approach
The BG2 approach does not provide p-values.To check if the pseudo-likelihood approach is calibrated, we compute p-values based on the pseudo-likelihood approach for two datasets from Sections 4.1 and 4.2 that do not have any causal SNP.In this case, if the pseudo-likelihood approach is calibrated then the distribution of the p-values should be a uniform distribution.Figure S9 presents a Q-Q plot of p-values from one binary dataset in Section 4.1 whereas Figure S10 presents a Q-Q plot of p-values from one count dataset in Section 4.2.It is clear from the figures that in both cases the p-values have a uniform distribution.Therefore, the pseudo-likelihood approach is calibrated.S8.Histograms of the response variables in the case studies Figures S11, S12, and S13 present the histograms of the response variables for each of the three case studies.From these figures, it is clear that the count response variables in the case studies presented in Sections 5.1 and 5.3 are skewed.However, the new Section S4 from the Supplementary Material shows that BG2 can deal with imbalanced data without difficulties, and that the performance of BG2 improves as the level of skewness increases.In addition, the binary response variable from the case study presented in Section 5.2 is imbalanced.However, the new Section S4 shows that BG2 is robust to imbalanced data, and even performs slightly better when the data are imbalanced.

Figure S9 :
Figure S9: Calibration of the pseudo-likelihood approach.Binary data simulated from human genome data.Q-Q plot of p-values based the pseudo-likelihood approach.

Figure S10 :
Figure S10: Calibration of the pseudo-likelihood approach.Count data simulated from A. Thaliana genome data.Q-Q plot of p-values based the pseudo-likelihood approach.

Figure S11 :
Figure S11: Case study from Section 5.1: Maximum number of alcoholic drinks.Histogram of the maximum number of alcoholic drinks.

Figure S12 :
Figure S12: Case study from Section 5.2: Cocaine dependence.Histogram of the response variable cocaine dependence that is equal to 1 if the subject is cocaine dependent and 0 otherwise.

Figure S13 :
Figure S13: Case study from Section 5.3: Root-like structures in A. Thaliana.Histogram of number of root-like structures in A. Thaliana.

Table S1 :
Performance of BG2 and SMA when data are skewed or tremendously skewed.When β = 0.1, count data are skewed.When β = 0.7, count data are tremendously skewed.

Table S3 :
Robustness to genome spacing of SNPs.Performance of BG2 and SMA when the number of causal SNPs decreases and the SNPs are clustered.SIM0: 20 evenly spaced causal SNPs.SIM1: 20 causal SNPs in four clusters.SIM2: 10 causal SNPs in two clusters.SIM3: 5 causal SNPs in one cluster.